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Abstract 


This is a further study on the severe acute respiratory syndrome (SARS) using the probabilistic models. The purpose was to define the 
potential targets for anti-SARS drugs in the structural proteins from human SARS related coronavirus (SARS-CoV) while knowing little 
about the functional sites and possible mutations in these proteins. From a probabilistic viewpoint, we can theoretically select the amino 
acid pairs as potential candidates for anti-SARS drugs. These candidates have a greater chance of colliding with anti-SARS drugs, are 
more likely to link with the protein functions and are less vulnerable to mutations. 


© 2004 Elsevier Inc. All rights reserved. 
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1. Introduction 


To deal with a possible recurrence of the severe acute 
respiratory syndrome (SARS), the determination of targets 
in SARS related coronavirus (SARS-CoV) is important 
and pressing for the development of anti-SARS drugs 
[1,3,7,20]. Without sufficient knowledge of the SARS-CoV 
at present, it is quite difficult to define the potential targets 
in SARS-CoV for anti-SARS drugs. To solve this problem, 
several approaches, such as the determination of binding 
sites in SARS-CoV [6], the alignment and multiple com- 
parison among protein data bank with Blastp and other 
computer software [5,11], the prediction of mutation sites 
in spike protein from SARS-CoV [35], have been taken to 
analyze the potential targets in SARS-CoV. 

Nevertheless, it is still necessary to use different ap- 
proaches to discover the potential targets in SARS-CoV for 
drug design. The structural proteins from SARS-CoV can 
be primarily considered the potential targets [1], because 
their functions are comparatively clear. Currently, we are 
attracted by five structural proteins from SARS-CoV, i.e. 
the replicase, spike, envelope, membrane and nucleocapsid 
proteins. These proteins are similar to other coronaviruses 
in function. The replicase polyprotein is a multifunctional 
protein containing the activities necessary for the transcrip- 
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tion of negative stranded RNA, leader RNA, subgenomic 
mRNAs and progeny virion RNA as well as proteinases 
responsible for the cleavage of the polyprotein into func- 
tional products. The spike protein is responsible for binding 
to receptors on host cells and for membrane fusion. The 
envelope and membrane glycoproteins are components of 
the viral envelope that plays a central role in virus morpho- 
genesis and assembly via their interactions with other viral 
proteins. The nucleocapsid protein is the major structural 
component of virons that associates with genomic RNA to 
form a helical nucleocapsid [4,7,10,14,15]. 

Without detailed knowledge of functional sites in the 
structural proteins in SARS-CoV, a probabilistically simple 
approach for drug efficacy is to target the abundant amino 
acids in these proteins. We could expect that the anti-SARS 
drugs have a greater chance to interact with SARS-CoV if 
the collision between anti-SARS drugs and structural pro- 
teins in SARS-CoV is a random event [28]. This being the 
case, there are two problems: (i) a single amino acid from 
abundant groups does not directly represent the functional 
groups in proteins, because a good signature pattern of a 
protein must be as short as possible, but the conserved se- 
quence is not longer than four or five residues [13]; and (ii) 
even if these amino acids would be located at the functional 
sites, there is still a chance of mutation at these amino acids 
leading to the inefficacy of anti-SARS drugs [17]. 

Therefore, further effort is made to find potential func- 
tional sites with small mutation possibility in the structural 
proteins from SARS-CoV. Over the last few years we have 
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developed three models using random approaches to analyze 
the functional amino acid pairs in proteins and to evaluate 
the mutation effects on amino acid pairs [30]. Generally, in 
terms of actual and predicted frequencies, our models clas- 
sify the amino acid pairs in a protein into two categories: 
the randomly predictable and the unpredictable. First, our 
models suggest that the amino acid pairs with large differ- 
ences between actual and predicted frequencies should be 
deliberately developed and probably be located at the func- 
tional sites, as the random construction of an amino acid 
pair is the least time- and energy-consuming. For the func- 
tional purpose, nature should have the intention to spend 
more time and energy to construct amino acid pairs with 
large differences between actual and predicted frequencies 
[24—26,29]. Second, our models reveal that the mutations 
are likely to occur at the amino acid pairs, whose actual 
frequency is larger than their predicted frequency [31-34]. 
Finally, our models demonstrate that the amino acid pairs 
with high Markov transition probability are less sensitive to 
mutations [21—23,27]. 

Our models suggest that the ideal targets are based not 
only on the abundance of amino acids, but also on the po- 
tentially functional sites as well as on a small chance of 
mutations. In this study we used our models to analyze five 
structural proteins from SARS-CoV to determine the poten- 
tial targets for anti-SARS drugs. 


2. Materials and methods 


The amino acid sequences of the replicase (access num- 
ber: P59641), spike (access number: P59594), envelope (ac- 
cess number: P59637), membrane (access number: P59596) 
and nucleocapsid (access number: P59595) proteins are ob- 
tained from the SWISS-PROT data bank [2]. 


2.1. Determination of potentially functional amino acid 
pairs 


For the determination of the potentially functional amino 
acid pairs in the structural proteins, the actual and predicted 
frequencies of amino acid pairs were calculated and the dif- 
ference between them compared. The detailed calculations 
and their rationales with examples are described below. 


2.1.1. Amino acid pairs in SARS-CoV spike protein 

The spike protein from human SARS-CoV consists of 
1255 amino acids. We count the first and second amino acids 
as an amino acid pair, the second and third as another pair, 
the third and fourth, until the 1254th and 1255th, thus there 
are 1254 pairs. As there are 20 types of amino acids and 
any amino acid pair can be composed from any of these 20 
types of amino acids, so theoretically there are 400 possible 
types of amino acid pairs. Again there are 1254 pairs in 
the spike protein, more than the 400 types of theoretically 
possible pairs. Clearly some of the 400 types should appear 


more than once. Meanwhile, it is reasonable to expect that 
some of the 400 types are absent from the spike protein. 


2.1.2. Actual frequency and randomly predicted frequency 
in SARS-CoV spike protein 

The randomly predicted frequency is governed by the 
simple permutation principle [8]. For instance, there are 39 
arginines (R) and 96 serines (S) in the spike protein. The 
predicted frequency of amino acid pair “RS” would be 3 
((39/1255) x (96/1254) x 1254 = 2.983). Actually we can 
find three “RS”’s in the spike protein, so the actual frequency 
of “RS” is 3. 


2.1.3. Randomly predictable present amino acid pairs 

As described in the last section, the predicted frequency 
of a randomly present amino acid pair “RS” would be 3 
and “RS” really appears three times in the protein, so the 
presence of “RS” is randomly predictable. 


2.1.4. Randomly unpredictable present amino acid pairs 

There are 84 alanines (A) in the spike protein, the fre- 
quency of a random presence of amino acid pair “AA” would 
be 6 ((84/1255) x (83/1254) x 1254 = 5.555), ie. there 
would be six “AA”’s in the spike protein. But in fact the “AA” 
appears 10 times in the protein, so the presence of “AA” 
is randomly unpredictable. This illustrates the case that the 
actual frequency of “AA” is larger than its predicted fre- 
quency. Another case is that the actual frequency is smaller 
than the predicted frequency, for example, there are 91 va- 
lines (V) in the spike protein and the predicted frequency of 
“AV” is 6 ((84/1255) x (91/1254) x 1254 = 6.091), while 
its actual frequency is only 3. 


2.1.5. Difference between actual and predicted frequencies 

Hence, there are three relationships between the actual 
and predicted frequencies, i.e. the actual frequency is smaller 
than, equal to or larger than the predicted frequency. Our 
previous studies suggest that the amino acid pairs with a 
big difference between actual and predicted frequencies can 
be deliberately developed and probably are located at the 
functional sites. 


2.2. Calculation of Markov transition probability 


To minimize the chance that the mutations occur at the 
targets of anti-SARS drugs, we need to find the amino acid 
pairs with high Markov transition probability. The Markov 
transition probability calculates the probability from one 
state to another state [18]. For an amino acid pair, an amino 
acid has a certain probability to follow a certain preceding 
amino acid, which constructs a conditional probability (the 
first-order Markov chain), i.e. the probability of an amino 
acid occurs in an amino acid pair given a certain first amino 
acid [P(second amino acid|first amino acid)]. The calcula- 
tion of this probability is the transition from the state of one 
amino acid to the state of an amino acid pair. More prac- 
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tically, in the case of the English language, the first-order 
Markov chain analyzes the probability, for example, that the 
“e” follows “w” from 26 available letters. Once again we use 
the spike protein as an example, there are 39 “R’’s and 39 
“C’s in this protein. Intuitively either “R” or “C” would have 
the same probability to follow “N”, actually the Markov tran- 
sition probability shows that the “C” has a bigger probability 
to follow “N” rather than “R”, so the amino acid pair “NC” 
is more stable and less sensitive to mutations than “NR”. 


2.3. Statistics 


The actual frequency and predicted frequency can be com- 
pared as follows. Generally each of the 20 types of amino 


acids has a chance of 1/20 (P = 0.05) to repeat once, and an 
amino acid pair has a chance of 1/400 (P = 0.0025) to repeat 
once in the protein primary structure. In case of the spike 
protein from human SARS-CoV, there are 99 threonines (T) 
and 99 leucines (L), the most abundant amino acids, and 11 
tryptophans (““W’’s), the least amino acid. If the first amino 
acid is “T’, then the chance of the second amino acid being 
“T” is 98/1254 (P = 0.078 > 0.05). If the first amino acid is 
“W”’, then the chance of the second amino acid being “W” is 
10/1254 (P = 0.008 < 0.01). Accordingly the chance of the 
first amino acid pair being “TT” is (99/1255) x (98/1254) 
(P = 0.0062 < 0.01), and the chance of the second amino 
acid pair being “TT” is (97/1253) x (96/1252) (P = 
0.0059 < 0.01). For the least frequent amino acid “W”, the 
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Fig. 1. Repetition of amino acid pairs in the structural proteins from SARS-CoV. 
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chance of first amino acid pair of “WW” is (11/1255) x 
(10/1254) (P = 0.00007 < 0.001), and the chance of 
second amino acid pair of “WW” is (9/1253) x (8/1252) 
(P = 0.00005 < 0.001). For that reason, the probability 
is <0.05 if the difference between any amino acid pairs is 
greater than or equal to one. This statistical consideration 
is also applied to the Markov chain transition probability. 


3. Results 


To determine the amino acid pairs which have a greater 
chance of random collision with the anti-SARS drugs, we 
count the frequency of each amino acid pair in these pro- 


teins. Fig. 1 shows the number of amino acid pairs versus 
their frequency. For example, the bottom panel indicates that 
105 amino acid pairs appear once, 46 pairs twice, 27 pairs 
three times, 14 pairs four times, seven pairs five times, two 
pairs six times and one pair eight times in the nucleocapsid 
protein. Clearly, the frequency of amino acid pairs is higher 
in the replicase protein than in the other four proteins. 
Although frequently appearing amino acid pairs have a 
greater chance of interaction with anti-SARS drugs, they are 
possibly not located at functional sites or well exposed to 
anti-SARS drugs. The differences between actual and pre- 
dicted frequencies were calculated in order to determine the 
amino acid pairs with high probability at functional sites 
(Fig. 2). Comparing the amino acid pairs in Figs. | and 2, it 
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Fig. 2. Difference between actual and predicted frequencies in the structural proteins from SARS-CoV. 
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can be found that the some of frequently appearing amino 
acid pairs are not the ones with large differences between ac- 
tual and predicted frequencies. For example, the amino acid 
pair “LL” appears 65 times in the replicase protein, while 
there is only one difference between actual and predicted 
frequencies. This means that “LL” is unlikely to be located 
at the functional sites although its frequency is the highest 
in the protein. 

Although the amino acid pairs with a big difference be- 
tween actual and predicted frequencies are more likely to be 
located at functional sites, they may be subject to the muta- 
tions. Our previous studies show that the amino acid pairs 
with a big difference between actual and predicted frequen- 
cies are more sensitive to new mutations, especially for the 


amino acid pairs whose actual frequency is larger than their 
predicted frequency [31-34]. In this case the anti-SARS 
drugs would be ineffective if their targets undergo mutations. 
But the anti-SARS drugs will still have effects if the targets 
are the amino acid pairs formed through mutations. So, the 
amino acid pairs with smaller actual frequency than their 
predicted frequency is preferred, as they are more likely to 
be formed through mutations [31-35]. In addition, this does 
not require knowledge of the possible outcome of a muta- 
tion. 

To find the amino acid pairs less sensitive to mutations, 
we calculate the first-order Markov transition probability 
(Fig. 3). For instance, the amino acid pairs “AN” and “PL” 
have the same values of both actual frequency (AF = 31) 
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Fig. 3. The first-order Markov transition probability in the structural proteins from SARS-CoV. 
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Table 1 
Some possible targets for anti-SARS drugs in structural proteins from 
human SARS-CoV 


Protein Pair Repetition AF-PF Markov probability 
Replicase LV 41 —14 0.061 
GL 29 -11 0.069 
LT 40 -7 0.059 
KV 24 —10 0.058 
TA 21 —15 0.042 
DL 34 —4 0.086 
VS 31 -—7 0.053 
TV 37 —4 0.075 
VG 24 —10 0.041 
Spike LS 5 —3 0.051 
SL 6 —2 0.063 
TL 6 —2 0.061 
FL 2 —2 0.06 
TV 4 —3 0.041 
LA 4 —3 0.04 
Envelope GQ 3 —1 0.067 
GS 3 —1 0.067 
KG 2 —1 0.069 
Membrane IL 1 —2 0.056 
AL 1 —2 0.053 
Nucleocapsid VL 1 —2 0.077 


and predicted frequency (PF = 26) in the replicase protein, 
but a difference can be found in their Markov transition 
probability, which is 0.061 for “AN” and 0.113 for “PL”. 
Therefore, the amino acid pair “PL” rather than “AN” is 


preferable as the potential target for drugs, because the “PL” 
is more stable than the “AN”. 

Taking the above three factors into account, we can the- 
oretically define the possible targets for the development of 
anti-SARS drugs. Table 1 lists the amino acid pairs which 
can serve as the potential targets for anti-SARS drugs in 
case about 10% of amino acid pairs are chosen in each pro- 
tein. 


4. Discussion 


In this study we outline the selection of the potential tar- 
gets for anti-SARS drugs in the structural proteins from 
SARS-CoV at the stage of no detailed knowledge on these 
proteins, on their functional sites, and on their mutation pat- 
terns. 

In such a situation, we can assume that the interaction 
between anti-SARS drugs and SARS-CoV is a random 
collision [28]. The abundant amino acids in the structural 
proteins from SARS-CoV would have a greater chance to 
collide with anti-SARS drugs [9,12] if each amino acid had 
an equal chance to be exposed to the drugs. In this manner 
we would have the first group of candidates as potential 
targets for anti-SARS drugs. 

Although a single amino acid can be the target of 
anti-SARS drugs [19], the targeted amino acid (except for 
the one at terminal) has a connection with two neighboring 


Fig. 4. The ideally targets (intersection among three circles) for anti-SARS drugs in relation to the amino acid pairs (open circle) grouped according to 
their frequencies in proteins, the amino acid pairs (lined circle) grouped according to the difference between actual and predicted frequencies and the 
amino acid pairs (gray circle) grouped according to their first-order Markov probability. 
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amino acids, whose connection constructs two amino acid 
pairs. We, therefore, regard the amino acid pairs as the 
basic unit for analysis following the concept that a good 
signature pattern of a protein must be as short as possible, 
but the conserved sequence is not longer than four or five 
residues [13]. Then the finding of abundant amino acids 
is changed into the finding of abundant amino acid pairs. 
Taking the frequency of amino acid pairs in proteins as a 
measure, we scale down the search coverage of candidates 
as potential targets for anti-SARS drugs. This group of can- 
didates constructs the aggregate of the open circle in Fig. 4. 
Ideally this circle should only contain the amino acid pairs 
exposed to anti-SARS drugs. 

Although the amino acid pairs selected by abundance have 
a greater chance to interact with anti-SARS drugs, the abun- 
dance does not directly represent the functional activity of 
amino acid pairs. We need to further narrow down the search 
coverage by calculating the difference between actual and 
predicted frequencies, which form the aggregate of lined cir- 
cle in Fig. 4. Ideally, the lined circle should only include the 
amino acid pairs at functional sites in proteins. The amino 
acid pairs with a big difference between actual and pre- 
dicted frequencies from the abundant amino acid pairs con- 
stitute the intersection between the open and lined circles in 
Fig. 4. 

Finally, we cannot neglect the possible mutations in 
SARS-CoV, which may lead to the inefficacy of anti-SARS 
drugs as shown in the drug development [16]. The first-order 
Markov transition probability determines the stability of 
amino acid pairs that are grouped in the gray circle in 
Fig. 4. As a result, we have the candidates which are more 
likely at functional sites and less vulnerable to mutations. 
These candidates are enclosed in the intersection between 
the lined and gray circles. We also have the candidates 
which is the intersection between the open and gray cir- 
cles. They not only appear more frequently, but also are 
less vulnerable to mutations. After balancing the three fac- 
tors, our search is converged to the candidates enclosed 
in the interaction among three circles, which have a great 
chance to collide with anti-SARS drugs, and are more 
likely to link with the functional sites in the structural pro- 
teins and less vulnerable to mutations. Our probabilistic 
model provides, at least partly, a conceptual framework 
of how to select potential targets for design of anti-SARS 
drugs. 
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